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While a set of covariance matrices corresponding to different populations are unlikely to be 

^^ exactly equal they can still exhibit a high degree of similarity. For example, some pairs of vari- 

,_^ ables may be positively correlated across most groups, while the correlation between other pairs 

^^ may be consistently negative. In such cases much of the similarity across covariance matrices can 

be described by similarities in their principal axes, the axes defined by the eigenvectors of the 

covariance matrices. Estimating the degree of across-population eigenvector heterogeneity can 

be helpful for a variety of estimation tasks. Eigenvector matrices can be pooled to form a central 

set of principal axes, and to the extent that the axes are similar, covariance estimates for popu- 

"^ lations having small sample sizes can be stabilized by shrinking their principal axes towards the 

across-population center. To this end, this article develops a hierarchical model and estimation 

^~^ procedure for pooling principal axes across several populations. The model for the across-group 

I heterogeneity is based on a matrix-valued antipodally symmetric Bingham distribution that can 

CO flexibly describe notions of "center" and "spread" for a population of orthonormal matrices. 

O 

q 

l' Some key words: Bayesian inference, copula, Markov chain Monte Carlo, principal components, 

^^ random matrix, Stiefel manifold. 
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^ 1 Introduction 

X 

H Principal component analysis is a well-established procedure for describing the features of a co- 

variance matrix. Letting UAU be the eigenvalue decomposition of the covariance matrix of a 
p-dimensional random vector y, the principal components of y are the elements of the transformed 
mean-zero vector U (y — E[y]). From the orthonormality of U it follows that the elements of the 
principal component vector are uncorrelated, with variances equal to the diagonal of A. Perhaps 
more importantly, the matrix U provides a natural coordinate system for describing the orientation 
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of the multivariate density of y: Letting Uj denote the jth column of U, y can be expressed as 
y — E[y] = ziUi + • • • + ZpUp, where (zi, . . . , Zp) is a vector of uncorrelated mean-zero random 
variables with diagonal covariance matrix A. 

Often the same set of variables are measured in multiple populations. Even if the covariance 
matrices differ across populations, it is natural to expect that they share some common structure, 
such as the correlations between some pairs of variables having common signs across the populations. 



With this situation in mind, Flury 11984 developed estimation and testing procedures for the 



"common principal components" model, in which a set of covariance matrices {Si, . . . , T,k} have 
common eigenvectors, so that Sj = UAjU for each j G {!,..., K}. A number of variations of 
this model have since appeared: Flury [1987 and Schott |1991[ 1999| consider cases in which only 
certain columns or subspaces of U are shared across populations, and Boik [2002 describes a very 
general model in which eigenspaces can be shared between all or some of the populations. 

These approaches all assume that certain eigenspaces are either exactly equal or completely 
distinct across a collection of covariances matrices. In many cases these two alternatives are too 
extreme, and it may be desirable to recognize situations in which eigenvectors are similar but 
not exactly equal. To this end, this article develops a hierarchical model to assess heterogeneity 
of principal axes across a set of populations. This is accomplished with the aid of a probability 
distribution over the orthogonal group Op which can be used in a hierarchical model for sample 
covariance matrices, allowing for pooling of covariance information and a description of similarities 
and differences across populations. Specifically, this article develops a sampling model for across- 
population covariance heterogeneity in which 



p(U|A,B,V) 
Ui,...,U,^ 



c(A, B)etr(BU'^VAV^U) 
i.i.d. p(U|A,B,V) 
UfcAfcU^, 



(1) 



where A and B are diagonal matrices and V G Op. The above distribution is a type of general- 



ized Bingham distribution Khatri and Mardia, 1977 Gupta and Nagar, 2000 that is appropriate 



for modeling principal component axes. Section 2 of this article describes some features of this 
distribution, in particular how A and B represent the variability of {Ui, . . . , U/c} and how V 
represents the mode. Parameter estimation is discussed in Section 3, in which a Markov chain 
Monte Carlo algorithm is developed which allows for the joint estimation of {A,B, V} as well as 
{(Ujt, Afe), k = 1, . . . , K}. The estimation scheme is illustrated with two example data analyses in 
Sections 4 and 5. The first dataset, previously analyzed by Flury [1984 and Boik [2002 among 



others, involves skull measurement data on four populations of voles. Model diagnostics and com- 
parisons indicate that the proposed hierarchical model represents certain features of the observed 
covariance matrices better than do less flexible models. The second example involves survey data 



from different states across the U.S.. The number of observations per state varies a great deal, and 
many states have only a few observations. The example shows how the hierarchical model shrinks 
correlation estimates towards the across-group center when within-group data is limited. Section 
6 provides a discussion of the hierarchical model and a few of extensions of the approach. 

2 A generalized Bingham distribution 

The eigenvalue decomposition of a positive definite covariance matrix S is given by S = UAU , 
where A is a diagonal matrix of positive numbers (Ai, . . . , Xp) and U is an orthonormal matrix, 
so that U U = UU = I. Writing UAU = Yl^j=i ^j'^j'^J ^ ^^ see that multiplication of a 
column of U by -1 does not change the value of the covariance matrix, highlighting the fact that 
that the columns of U represent not directions of variation, but axes. As such, any probability 
model representing variability across a set of principal axes should be antipodally symmetric in the 
columns of U, meaning that U is equal in distribution to US for any diagonal matrix S having 
diagonal elements equal to plus or minus one, since U and US represent the same axes. 



Bingham 1974 described a probability distribution having a density proportional to exp{u Gu} 
for normal vectors {u : u^u = 1}. This density has antipodal symmetry, making it a candidate 



model for a random axis. Khatri and Mardia 1977 and Gupta and Nagar [2000 discuss a matrix- 



variate version of the Bingham distribution, 

p(U|G, H) oc etr(HU^GU) (2) 

where G and H aie pxp symmetric matrices. Using the eigenvalue decompositions G = VAV and 
H = WBW'^, this density can be rewritten as p(U|A,B, V,W) oc etr(B[W^U^V]A[V'^UW]). 
A well-known feature of this density is that it depends on A and B only through the differences 
among their diagonal elements. This is because for any orthonormal matrix X (such as V UW), 
we have 

tr([B + dI]X^[A + cI]X) = tr(BX^AX) + d x tr(X^AX) + c x tr(BX'^X) + cd x tr(X'^X) 

= tr(BX^AX) +dx tr(A) + c x tr(B) + cdp 

and so the probability densities p(U|A,B) and p(U|A + cI,B + dl) are proportional as functions 
of U and therefore equal. By convention A and B are usually taken to be non-negative. In what 
follows we will set the smallest eigenvalues Up and bp to be equal to zero. 

Although a flexible class of distributions for orthonormal matrices, densities of the form Q are 
not necessarily antipodally symmetric. However, we can identify conditions on G and H which 
give the desired symmetry. If either G or H have only one unique eigenvalue, then tr(HU GU) 
is constant in U and therefore trivially antipodally symmetric in the columns of U. Otherwise, we 
have the following result: 



Proposition 1. If G and H both have more than one unique eigenvalue, then a necessary and 
sufficient condition for ti(H.\J GU) to be antipodally symmetric in the columns o/U is thatH be 
a diagonal matrix. 

Proof. The symmetry condition requires that tr(HU GU) = tr(SHSU GU) for all U G Op and 
diagonal sign matrices S. If H is diagonal then SHS = H and so diagonality is a sufficient condition 
for antipodal symmetry. To show that it is also a necessary condition, first let V and A be the 
eigenvector and eigenvalue matrices of G. Then for any orthonormal X we have X = U V if 
U = VX . Therefore the symmetry condition is that 

tr([H - SHS]XAX^) = for ah X G Op and S G {diag(s), s G {±1}^} . 

If we let diag(S) = (—1, 1,1,...,!) then D = H — SHS is zero except for possibly Dm _ii and 

Dr_i 1], the first row and column of D absent di^i. Additionally, if not all entries are zero then D 

is of rank two with eigenvectors e and Se, where e = (|Hr ;^ ]^i|, /i2,i, /i3,i, . . . , /ip,i)/(\/2|Hr j^^^jl), 

and corresponding eigenvalues itd, where d = 2|Hr_i ii|. Writing the symmetry condition in terms 

of the eigenvalues and vectors of D gives 

p 
= tr(DXAX^) = d^Ojxf (ee^ - See^S) x^. 

The symmetry condition requires that this hold for all orthonormal X. Now if k and I are the 
indices of any two unequal eigenvalues of G, we can let x^ = e and x; = — Se, giving 

= tr(DXAX^) = (i[afe(l-e^See'^Se) + a;(e^See'^Se- 1)] 
= d{l-[e^Sef){ak-ai). 

Since a^ 7^ ai by assumption, this means that either d = or (e'^Se)^ = 1. Neither of these 
conditions are met unless all entries of D are zero, implying that the off-diagonal elements in the 
first row and column of H must be zero. Repeating this argument with the diagonal of S ranging 
over all p-vectors consisting of one negative-one and p — 1 positive ones shows that all off-diagonal 
elements of H must be zero. D 

Based on this results we fix the eigenvector matrix of H to be I and our column- wise antipodally 
symmetric model for U G Op is 

Pb(U| A, B, V) = c(A, B)etr(BU^VAV^U) (3) 

where A and B are diagonal matrices with ai > a2 > ■ ■ ■ > ap = 0, bi > b2 > ■ ■ ■ > bp = and 
V G Op. Interpreting these parameters is made easier by writing X = V U and expanding out 
the exponent of pB as 

p p p p 

tr(BU^VAV^U) = X] Z] '^ibji^I^jf = ^Y1 aibjxlj = a^(X o X)b, (4) 

1=1 j=l i=l j=l 



where "o" is the Hadamard product denoting element- wise multiphcation. The value of x? • de- 
scribes how close column i of V is to column j of U. Since both a and b are in decreasing order, 
aibi is the largest term and the density will be large when xf ^ is large. However, due to the 
orthonormality of X, a large xf ^ restricts xf 2 ^iid ^2 1 t° ^^ small, which then allows Xg 2 to be 
large. Continuing on this way suggests that the density is maximized if X o X is the identity, i.e. 
U = VS for some diagonal sign matrix S. 

Proposition 2. The modes of ps include V and {VS : S = diag(s),s £ {±1}^}. If the diagonal 
elements of A and B are distinct then these are the only modes. 

Proof. The matrix XoX is an element of the set of orthostochastic matrices, a subset of the doubly 
stochastic matrices. The set of doubly stochastic matrices is a compact convex set whose extreme 
points are the permutation matrices. Since every element of this compact convex set can be written 
as a convex combination of the extreme points, we have 

max a^(X o X)b < maxa^(V 9kPk)h 

for probability distributions Q over the finite set of permutation matrices. If the elements of a and 
b are distinct and ordered it is easy to show that a Pb is uniquely maximized over permutation 
matrices P by P = I, and so the right-hand side is maximized when Q is the point- mass measure 
on I. Since I is orthostochastic, the maximum on the left-hand side is achieved at (XoX) =1. D 

If two adjacent eigenvalues are equal then the density has additional maxima. For example, 
if h\ = 62 then a-^Zb is maximized over doubly stochastic matrices Z by any convex combination 
of the matrices corresponding to the permutations {1,2, 3, . . . ,p} and {2, 1,3,... ,p}. In terms of 
U, this would mean that modes are such that u^^Vj = ±1 for i > 2, with ui and U2 being any 
orthonormal vectors in the null space of {V3, . . . , Vp}. More generally, how the parameters (A, B) 
control the variability of U around V can be seen by rewriting the exponent of (p| a few different 
ways. For example, equation Q, can be expressed as 

V P 

tr(BU^VAV^U) = ^^aibj{vjujf 

From equations (4 1 and (5) it is clear that bj = bj-^-i implies that Uj = Uj+i and Oj = Oj+i implies 
y^Juj = \rf^-^Uj. In this way the model can represent eigenspaces of high probability, not only 



eigenvectors, providing a probabilistic analog to the common space models of Flury 11987 . To 



illustrate this further. Figure [ij shows the expectations of the squared elements of X = V U for 
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Figure 1: Expected values of the squared entries of X = V U under two different Bingham 
distributions. Light shading indicates high values. 



two different values of (A, B) (calculations were based on a Monte Carlo approximation scheme 
described in Hoff| |2007b| ) . The plot in the first panel is based on the generalized Bingham distribu- 
tion in which diag(A) = diag(B) = (7,5,3,0,0,0). For this distribution, ui, U2 and U3 are highly 
concentrated around vi, V2 and V3 respectively. Since 64 = 65 = be, the vectors U4, U5 and ug 
are equal in distribution and close to being uniformly distributed on the null space of (vi, V2,V3). 
These particular values of (A, B) could represent a situation in which the first three eigenvectors 
are conserved across populations but the others are not. The second panel of Figure [T] represents a 
more complex situation in which diag(A) = (7, 5, 3, 0, 0, 0) and diag(B) = (7, 7, 0, 0, 0, 0). For these 
parameter values the following components of X = V U are equal in distribution: columns 1 and 
2; columns 3, 4, 5 and 6; rows 2 and 3; rows 4, 5 and 6. Such a distribution might represent a 
situation in which a vector vi is shared across populations, but it is equally likely to be represented 
within a population by either ui or U2. 



3 Pooled estimation of covariance eigenstructure 

In the case of normally distributed data the sampling model for S^ = Y^ (I ^11-^) Y^, the 

observed sum of squares matrix in population k, is a Wishart distribution. Combining this with 
the model for principal axes developed in the last section gives the following hierarchical model for 
covariance structure: 

Ui, . . . , U;^ ~ i.i.d. p(U| A, B, V) (across-population variability) 

Sfc ~ Wishart (UyfcAfcU^ ,nfc — 1) (within-population variability) 



The unknown parameters to estimate include {A, B, V} as well as the within-population covariance 
matrices, parameterized as {(Ui, Ai), . . . , (\Jk,-^k)}- In this section we describe a Markov chain 
Monte Carlo algorithm that generates approximate samples from the posterior distribution for these 
parameters, allowing for estimation and inference. The Markov chain is constructed with Gibbs 
sampling, in which each parameter is iteratively resampled from its full conditional distributions. 
We first describe conditional updates of the across-population parameters {A, B, V}, then describe 
pooled estimation of each U^, which combines the population-specific information S^ with the 
across-population information in {A, B, V}. 

3.1 Estimation of across-population parameters 

Letting the prior distribution for V be the uniform (invariant) measure on Op, we have 

K 



p(V|A,B,Ui,...,U;^) a p{Y)llp{lJk\A,B,\) 

k=l 
K 

oc etr(J^BU^VAV^U,,) 
fc=i 

K 

= etr(^AV^UfeBUfV) =etr(AV^[^UfeBU^]V), 



k=l 



and so the full conditional distribution of V is a generalized Bingham distribution, of the same 
form as described in the previous section. Conditional on the values {A, B, Ui, . . . , U/^}, pairs of 
columns of V can be sampled from their full conditional distributions using a method described in 



Hoff 2007b 



Obtaining full conditional distributions for A and B is more complicated. The joint density of 
{Ui, . . . , \Jk} is c(A, B) etr(^ AV U^BU^ V). From the terms in the exponent we have 

K K p p 

tr(J^BU^V^AV^Ufe) = ^tr(BU^VAV^Ufc) = ^ J^ ^ ai6,(vfu,-fc)2 = a^Mb 

fe=l fc=l i=l i=l 

where M is the matrix M = X]fe=i(V U^) o (V Ufc). The normalizing constant c(A, B) is equal 
to oFq{A,'B)~^, where oFq(A,'B) is a type of hypergeometric function with matrix arguments 
[Herz 1955] . Exact calculation of this quantity is problematic, although approximations have 



been discussed in Anderson [1965 , Const antine and Muirhead [1976 and Muirhead [1978 . The 



first-order term in these approximations is 

c(A,B) ^ c(A,B) = 2-P7r-(2)e-^^^ J|(a, - aj)^/\bi - hj)^/^ . 



i<j 



This gives the following approximation to the likelihood for A,B: 

p(Ui,...,Ui^|A,B,V) w c(A,B)^etr(a^Mb) 

oc exi>{-Si^{KI-M)h}Y[{ai-aj)^/\bi-bj)^/^ (6) 

i<j 

However, there is an identifiability issue with this likelihood: As seen above, since A and B are 
diagonal, tr(BX"^AX) simplifies to a^(X o X)b = J2i J2j ^i^j^l j- This means that for any c > 0, 
p(U| A, B, V) = p(U|cA, c^-'^B, V), and so the scale of A and B are not separately identifiable. To 
account for this, we parameterize A and B as follows: 

diag(A) = (ai,...,ap) = v^(ai,. . . ,ap) 
diag(B) = {bi, . . . ,bp) = y/w{l3i, . . . , I3p) 

where t(; > 0, 1 = ai > a2 > • • • > "p-i > Op = and 1 = /3i > /32 > • • • > /5p-i > /3p = 0. 
Rewriting d6| in terms of these parameters and multiplying by a prior distribution p{oL, (3, w) gives 

p(q,/3,w|M) o^p{a,(3,w) x exp{-wQ^(is:i - M)/3}w;(2)^/2 JJ(ai - aj)^^^(A - /?j)^/^. (7) 

In what follows we take the prior distribution p{cx,P,w) such that 1 > a2 > " " " > Op-i > and 
1 > /?2 > • • • > Pp~i > are two independent sets of order statistics of uniform random variables 
on [0, 1], and w has a gamma distribution. With these priors, the values of a and /3 can be sampled 
from their full conditional distributions on a grid of [0, 1], and the full conditional distribution of 
u; is a gamma distribution. For example, if w ~gamma(r7o/2,rg/2) a priori, then p(u;|a,/3,M) is 
gamma(r?o/2 + (f)i^/2, r]oT^/2 + «^(KI - M)^). 

We should keep in mind that this full conditional distribution is based on an approximation to 
the normalizing constant c(A, B) (although it is a "bona fide" full conditional distribution under 
the prior p(A,B) = p(A, B)[c(A,B)/c(A, B)]). Results from Anderson 1965 show that in terms 
of the parameters {a, (5, w}, 

'l^r'^i,T.\(-'-m-m-'^oil,) (8) 

Further terms in the expansion are available in Anderson [1965 . In problems where w is large then 



the approximation is likely to be a good one. In cases where the differences between consecutive 
aj's or /3j's is small compared to 1/w then it may be desirable to correct for the approximation 
using a few additional terms from ([8]) via a Metropolis-Hastings procedure. For example, letting 
/i(q!,/3, w) = 1 + X]i<J4tL'(aj — Cij){[3i — I3j)]~^, if Ws is the current value of w in the Markov chain 
and w is sampled from the approximate full conditional distribution of w based on ([7|, then the 
correction can be implemented as follows: 



1 . sample a proposal w from the full conditional of w based on ([7| ; 

2. sample u ~ uniform(0,l) and compute r = [h{ot, f3,w)/h{a,P,Ws)]^', 

3. if n < r then set Wg+i = w, otherwise set Wg+i = Ws- 

A similar procedure using one more order in the expansion is used in the example data analyses of 
the next section. 



Finally, we note that there has been some recent progress in computing qFq{A, B) exactly. Koev 



and Edelman [2006 provide an algorithm that is fast enough to be used in MCMC algorithms for 



problems in which p roughly 5 or less and the values of A and B are not too large. For other 
problems the approximations based on ([T]) and ([8| still seem necessary. 

3.2 Estimation of population-specific principal axes 

As described above, the within-population sampling model for the sample sum-of-squares matrix 
Sfc is Wishart(UfcAfcU^ , n^ — 1), so that as a function of S^, Ufc and A^, 

p(Sfc|Ufc, Afc) oc |Afcr("''-i)/2|S,|("^-P-2)/2 X etr(-^A,iU^S,Ufc). (9) 

In the absence of information from other populations, a uniform prior distribution on U^ would 
5deld a generalized Bingham full conditional distribution for U^. However, combining © with the 
across-population information p(Ufc| A, B, V) gives a non-standard full conditional distribution for 
Ufc: 



p(Ufc|Sfc,Afc,A,B,V) oc p(Sfe|Ufc,Afc)xp(Ufc|A,B,V 

1 
2 



oc etr(--A^^U^SfcUfc) x etr(BU^VAV^Ufc). 



The terms in the exponents are difficult to combine as they are both quadratic in Ufc. Writing out 
the expression in terms of the columns {ui^^, . . . , Up_fc} yields some insight: 

1 ^ / 1 \ 

tr(BU^VAV^Ufc - -A^'ulSkVk) = Y. ""h b,\AV^ - -XjlSk u,,k 

This suggests that the full conditional distribution of the jth column vector of U^ is a vector- 
valued Bingham distribution. This is true in a very limited sense: Since U^ is an orthonormal 
matrix the full conditional distribution of Uj^k given the other columns of Ufc must have support 
only on ibu, where u represents the null space of the vectors {ui^^, . . . ,Uj_i fc, Uj+i fc, . . . ,Up_fc}. 
Iteratively sampling the columns of Ufc from their full conditional distributions would therefore 
produce a reducible Markov chain which would not converge to the target posterior distribution. 
One remedy to this situation, used by Hoff [2007b in the context of sampling from the Bingham 



9 




distribution, is to sample from the full conditional distribution of columns taken two at a time. 
Conditional on {us^fc, . . . ,Up^fc}, the vectors {uijj,U2.a;} are equal in distribution to NZ, where N 
is any p x 2 dimensional orthonormal basis for the null space of {U3 fc, . . . , Up ^j and Z is a random 
2x2 orthonormal matrix whose density with respect to the uniform measure is proportional to 

p(Z) oc etr (z]^ Gzi + Z2 HZ2) , 

where G = N^(&iVAV'^-A-jSfc)N, H = N^(62VAV^- A-jSfc)N and zi and Z2 are the columns 
of Z. Since Z is orthogonal, we can parameterize it as 

s sin cf) \ 
—scoscj) J 

for some (j) £ (0, 2tt) and s = ±1. The uniform density on the circle is constant in (p, so the joint 
density of {(p, s) is simply p(Z((/», s)). Sampling from this distribution can be accomplished by first 
sampling S (0, 27r) from a density proportional to 

p{(j)) cxexp([5ri,i + /i2,2] cos^ (/) + [/ii,i +52,2]sin^0+ [gi^2 +92,1 - /ii,2 - /i2,i] cos sine/)), 
and then sampling s uniformly from { — 1, +1}. 

3.3 Estimation of eigenvalues 

From ([9| we see that the conditional distribution of A^ given U^ and S^ has the following form: 

p(Afc|Ufc,Sfc) oc p(Afc)|Afcr("'=-i)/2etr(-iA^iufSfcUfc) 

The part not involving the prior distribution has the form of an inverse-gamma density, and indeed, 
if p(Afc) were the product of inverse-gamma densities with parameters {1^0/2, UQaQ/2) then the full 
conditional distribution of Ajjt would be inverse-gamma [(I'o + n — l)/2, (uQa^ + uJj^SkUj^j.)/2]. 
However, it may be desirable to add more structure to the estimation of the eigenvalues. In 
usual one-sample principal component analysis the eigenvalues are labeled in order of decreasing 
magnitude and attention is focused on the "first few" eigenvectors, i.e. those corresponding to the 
largest eigenvalues. In terms of making comparisons of eigenvectors across groups, restricting the 
eigenvalues to be ordered means that the ordered columns of V refer to the ordered columns of 
U. One concern about such a restriction would be how it might affect inference in the case of 
a shared eigenvector that is the first principal axes in some groups, and possibly the second or 
third in other groups. This sort of heterogeneity can in fact be represented with the generalized 
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Bingham distribution even if the eigenvalues are order-restricted. For example, the distribution 
with diag(A) = (a, 0, 0, . . .) and diag(B) = (6, b,0, . . .) represents a population in which, with equal 
frequency, one of the first two columns of U is near the first column of V. Because of this flexibility, 
in what follows we estimate the eigenvalues in each group as being ordered. A convenient prior 
distribution is that p(Afe) is the product of inverse-gamma densities described above, but restricted 
to the space Xi^k > X2,k > ■ ■ ■ > Xp^k- The full conditional distribution of Xj^k is then inverse- 
gamma[(i/o + n — l)/2, (t'oCg + uJj^SfcUj^fc)/2] but restricted to the interval (Aj+i^fc, Aj_i^fc). 

3.4 Summary of MCMC algorithm 

The unknown parameters in the hierarchical model are the group-specific eigenvectors and values 
{Ui,Ai}, . . . ,{JJk,-^k} and the parameters {A,B, V} describing the across-group heterogeneity 
of eigenvector matrices. The diagonal matrices A and B are parameterized as 



diag(A) = {ai,...,ap) = ^/w{al,...,ap) 
diag(B) = {bi,...,bp) = Vw{(3i, . . . , f3p) 

with 1 = cti > • • • > Op = and 1 = f3i > ■ ■ ■ > (3p = 0. Convenient prior distributions are V ~ 
uniform Op, (02, • . . , Op-i) and (P2, ■ ■ ■ , Pp-i) are uniform on [0, 1] subject to the ordering restric- 
tion, w ~ gamma(?7o/2, rQ/2) and (l/Ai^^, . . . , 1/Xp^k) are the order statistics of a sample from a 
gamma(fo/2, (Tq/2) distribution. With these prior distributions, a Markov chain in the unknown pa- 
rameters that converges to the posterior distribution p({Ui,Ai}, . . . , {JJkj-^k}, A, B, V|Yi, . . . , Y^) 
can be constructed by iteration of the following sampling scheme: 

1. Update the within-group parameters: 

(a) Update {Ui, . . . , Vk}'- For each k and a randomly selected pair {ji,J2} C {1, . . . ,p}; 

1. let N be the null space of the columns {uj^ : j {ji,J2}}', 
ii. compute G = N^(6,,VAV^ - AT^|^Sfc)N, H = N^(6,-,VAV^ - Xj^]^Sk)N; 
iii. sample Z = (zi,Z2) G O2 from the density proportional to exp(z{'Gzi -|- z|'Hz2) 
iv. set Ujj^fc to be the first column of NZ and Uj^^a: to be the second. 

(b) Update {Ai, . . . ,Ak}'- Iteratively for each j £ {!,... ,p} and k G {1, . . . , K}, sample 
Xj^k ~inverse-gamma[(z/o + ^fc — l)/2,(foO'o + u^fcS,fcUj^fc)/2], but constrained to be in 

(^j-l,fc) ^j+l,kj- 

2. Update the across-group parameters: 

(a) Update V: Sample V from the Bingham density proportional to etr(AV [^ U^BU^JV). 

(b) Update w, a and (3: Compute M = Ef=i(V^Ufc) o (V^Ufe) and 
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i. sample w ~ gamma(7?o/2 + (2)^/2, Voro/2 + (X^IKI - M]/3) 
ii. for each i £ {2, . . . ,p — 1} sample Oj G (oi-i, ckj+i) from the density proportional to 

exp{-ai(u;/3^M[,,])} Uj-.j^i l«. " Ojl''/'- 
iii. for each j £ {2, . . . ,p — 1} sample Pj S (/?j_i,/3j+i) from the density proportional 

to exp{-/3,>Mf;.]a)} 0,.^, \Pj - Al^/'- 

As discussed in section 3.1, it may be desirable to make Metropolis-Hastings adjustments to the 
steps in 2(b) to account for the approximation to the normalizing constant c(A, B). Functions and 
example code for this algorithm, written in the R programming environment, are available at my 



website, http : //www . stat . Washington . edu/hof f / , 



4 Example: Vole measurements 



Flury [1987 describes an analysis of skull measurements on four different groups of voles. The four 
groups, defined by species and sex, are male and female Microtus californicus and male and female 
Microtus ochrogaster, having sample sizes of 82, 70, 58 and 54 respectively. Flury provides the 
sample covariance matrices of four log-transformed measurements corresponding to skull length, 
toothrow length, cheekbone width and interorbital width. The eigenvectors of the four empirical 
covariance matrices are given in Table [T] The first eigenvector in each group can roughly be 
interpreted as measuring overall size variation, and its values seem fairly similar across groups. 
The remaining eigenvectors also display a high degree of similarity across groups. By performing 
statistical tests of various hypotheses regarding the four population covariance matrices, Flury 
concludes that although the sample covariance matrices appear similar, there is enough evidence 
to reject exact equality of the population matrices. Furthermore, Flury rejects a hypotheses that 
the population covariances are proportional to each other, and then suggests a model in which 
the the covariance matrices share a single eigenvector (interpreted as corresponding to size), with 
the remaining eigenvectors and all of the eigenvalues being distinct across groups. In this section 
we reanalyze these data using the hierarchical eigenmodel discussed above, and compare it to the 



model in Flury [1987 and a few others. In particular, we show that allowing information-sharing 
across the groups where appropriate, but not forcing any of the eigenvectors to be exactly equal, 
results in a model that better represents features of the observed dataset. 



Using the sample sizes and sample covariance matrices provided in Flury [1987 , centered ver- 



sions of Y^ Yfc for each group k G {1, ... ,4} were reconstructed and used as data for the model 
described in Section 3. The prior distribution of w was taken to be a diffuse exponential with a 
mean of 1000, and the prior distribution for the inverse-eigenvalues was exponential with a mean 
of 1. A Markov chain consisting of 10,000 iterations was constructed, for which parameter values 
were saved every 10th iteration giving a total of 1000 posterior samples for each parameter. Mixing 
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M. californicus 










males 






females 




36.31 


27.01 


8.05 


2.78 


52.44 


21.14 


3.75 


3.17 


0.49 


-0.31 


-0.19 


0.79 


0.53 


-0.31 


-0.29 


0.74 


0.60 


-0.10 


0.76 


-0.24 


0.56 


-0.06 


0.82 


-0.11 


0.55 


-0.12 


-0.63 


-0.54 


0.57 


-0.14 


-0.48 


-0.65 


0.30 


0.94 


-0.06 


0.17 


0.30 


0.94 


-0.11 


0.14 








M. ochrogaster 










males 






females 




36.30 


9.67 


7.97 


2.80 


35.61 


12.35 


8.32 


3.38 


0.58 


0.04 


-0.38 


0.71 


0.56 


0.06 


0.05 


0.82 


0.45 


0.72 


0.51 


-0.13 


0.47 


0.02 


0.80 


-0.38 


0.51 


-0.08 


-0.51 


-0.69 


0.66 


-0.25 


-0.58 


-0.40 


0.45 


-0.68 


0.58 


-0.02 


0.13 


0.97 


-0.17 


-0.14 



Table 1: Eigenvalues and eigenvectors of empirical covariance matrices. Eigenvalues are given in 
the first row for each group. 

of the Markov chains was monitored via a variety of parameter summaries computed at each saved 
iteration. For example, for each saved value of A and B the average and standard deviation of the 
logs of the (p— 1) X (p— 1) = 9 non-zero values of A o B were obtained and plotted sequentially 
in the first panel of Figure [2} 

A posterior point estimate of V can be obtained from the eigenvector matrix of the posterior 
mean of VAV , obtained by averaging across samples of the Markov chain. This produces the 
matrix in Table |2] which is nearly identical to the eigenvector matrix of the pooled covariance matrix 
^ Y^Yfc/(nfc — 1), which is also given in the table. Posterior mean estimates of the eigenvalue 
matrices {Ai, . . . , A^} were all within 1.0 of their corresponding values based on the the empirical 
covariance matrices. 

Table [T] suggests that the first and fourth eigenvectors are the most preserved across groups, 
whereas the other two are less well-preserved. Letting U^ be the eigenvector matrix of Y^^, Y^ and 
V the eigenvector matrix of Yl^k^k/{nk — 1), the differential heterogeneity of the eigenvectors 
can be described numerically by computing the value of diag(V Ufc)^ and averaging each of the 
p entries of this vector across the K groups. This p-dimensional function of the observed data 
gives i(YfYi,...,YjY4) = (0.98,0.85,0.86,0.96), indicating that by this metric the first and 
last eigenvectors are most preserved across groups. To examine how well the model represents this 
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hierarchical model estimate empirical estimate 

0.54 -0.27 -0.19 0.77 0.55 -0.25 -0.17 0.78 

0.54 -0.10 0.80 -0.22 0.53 -0.10 0.81 -0.23 

0.56 -0.15 -0.56 -0.59 0.57 -0.16 -0.56 -0.57 

0.30 0.95 -0.06 0.10 0.30 0.95 -0.05 0.08 

Table 2: Model-based and empirical estimates of V. 

observed heterogeneity, the value of t(Yf Yi, . . . , Y4 Y4) can be compared to its posterior predictive 
distribution under the model. This was done by generating simulated values Yi Yi, . . . , Y4 Y4 
every 10th iteration of the Markov chain and computing the statistic t{) described above, resulting 

~ T ~ ~ T ~ 

in 1000 samples of t(Yi Yi, . . . , Y4 Y4) from the posterior predictive distribution. For simplicity 
we present below only the minimum and maximum values of this statistic, which for our observed 
data are (0.85, 0.98). 

The top of the second panel of Figure [2] shows the posterior predictive distributions of this 
statistic on a logit scale under the hierarchical model which pools information across eigenvector 
matrices. The observed values are well within the predicted range, indicating that the model is 
able to represent the differential amounts of eigenvector preservation among the observed covariance 
matrices. In contrast, the lower part of the plot shows a posterior predictive distribution generated 
under a one-shared-eigenvector model similar to the one in Flury [1987 , obtained obtained by 
fixing w = 1000, qi = /3i = 1 and aj = /3j = for j > 1. This model accurately predicts 
the highest degree of preservation across eigenvectors, but underestimates the preservation among 
other eigenvectors. This is not surprising, as this model shares information only across a single 
eigenvector. 

Lastly we fit two other related models: a "no pooling" model in which no information was shared 
across groups and a common covariance matrix model in which it is assumed that the covariances 
are exactly identical across groups. Not surprisingly these two models under- and over-represent 
the similarity across eigenvectors of observed covariance matrices, as shown in the third panel of 
Figure [2j Taken together, these results indicate that assuming complete equality, or completely 
ignoring similarity, can misrepresent the variability of covariance structure across groups. 

5 Example: National Health Communication Study 



The 2005 Annenberg National Health Communication Survey (anhcs. asc.upenn.edu I gathered 
self-reported health and lifestyle data from 2,989 members of the adult U.S. population under the 
age of 65. Among the variables recorded were the following: 
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Figure 2: MCMC and model diagnostics for the Vole example. The first panel shows averages 
(black) and standard deviations (gray) of the log entries of A o B at every 10th scan of the Markov 
chain. The second and third panels show posterior predictive distributions of the minimum and 
maximum similarity statistics under a variety of models, with the observed value of each statistic 
represented by a vertical line. 

state: state of residency (including the District of Columbia) 

f ruitveg: typical number of servings of fruit and vegetables per day 

exercise: typical weekly frequency of exercise 

bmi: body mass index 

alcohol: number of days in the month consuming five or more alcoholic drinks 

smoke: typical number of cigarettes smoked per day 

age: in years 

female: indicator of being female 

income: household income (19 ordered categories) 

edu: education level (no degree, high school, some college. Bachelor's degree or higher) 

In this section we estimate state-specific correlation matrices in a Gaussian copula model for the 
p = 9 ordinal variables above. More specifically, we model the observed data vectors yi, . • • ,y„ 
within a particular state as monotone functions of latent Gaussian random variables, so that 

zi, . . . , z„ ~ i.i.d. multivariate normal(0, S) 
Vij ^ 9j\^i,j)- 
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The non-decreasing functions {gi, ■ ■ ■ ,gp} are state-specific as is the covariance matrix S. We 
compare two models for Si, ... , Sx, the first being one in which no information is shared and 
Ui, . . . ,Ux are a priori independent and uniformly distributed on Op. The second is the hierar- 
chical eigenmodel in which Ui,...,Ui^ ~ i.i.d. pb(U|A,B, V), with the parameters {A,B,V} 
unknown and estimated from the data, using the same prior distributions as in the previous 
section. For both models, the prior distribution on the eigenvalues in each group is such that 
{1/Ai < • • • < 1/Ap} are the order statistics of p independent exponential(l) random variables. 
We note that both of these models ignore the possibility that heterogeneity in correlation matrices 
might be associated with state-specific characteristics such as population size or geographic location 
(although some ad- hoc exploratory analyses suggest these effects are small). 

Parameter estimation for this hierarchical copula model can be accomplished by iterative sam- 
pling of the parameters from their full conditional distributions as in Section 3 with the latent 
z's taking the roles of the observed y's, along with iterative sampling of the z's from their full 
conditional distributions (which are constrained normal distributions). This latter step is de- 
scribed for a one-group discrete-data copula model in Hoff [2007a . We note that this is a type 



of parameter-expanded estimation scheme Gelman et al. , 2008 in that the scale of each variable 



j can be represented by both gj and T,jj, and so these quantities are not separately identifiable. 
However, the posterior distribution of {T,i, . . . ,T,k} induces a posterior distribution over state- 
specific correlation matrices {Ci, . . . , Ck}, which are the parameters of primary interest in copula 
models. In the posterior analysis that follows we focus mostly on comparing the hierarchical and 
non-hierarchical posterior mean estimates of the state-specific correlation matrices. 

Markov chains consisting of 105,000 iterations were constructed for each of the two models, 
with results from the first 5000 iterations being discarded to allow for burn-in. Parameter values 
from the remaining iterations were saved every 50th iteration, leaving 2000 Monte Carlo samples 
for approximating the posterior distributions. The correlation parameters mixed reasonably well: 
In the hierarchical model, the effective sample sizes for 90% of the parameters was greater than 
500, and for 50% it was greater than 1200 (effective sample size is an estimate of the number of 
independent samples required to estimate the mean to the same precision as with a given auto- 
correlated sample). Mixing of the hierarchical parameters was slower: The first panel of Figure^ 
plots the mean and standard deviation of the logs of the 64 non-zero values of the matrix A o B 
at every 50th scan of the Markov chain for the hierarchical model. The effective sample sizes for 
these two functions of the parameters were both just over 100. 

The second panel of Figure [5] plots the first two eigenvectors of the posterior mean of the state- 
averaged correlation matrix X^,t=i ^k/K for each of the two models. The results are quite similar, 
indicating that the main correlations across states are described by smoking and drinking behavior 
being negatively correlated with education level, income, fruit and vegetable intake and exercise. In 
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Figure 3: The first panel shows averages (black) and standard deviations (gray) of the log entries 
of A o B. The second panel shows the first two principal component loadings for the hierarchical 
(black) and non-hierarchical (gray) models. 
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Figure 4: Posterior mean estimates of state-specific correlations. The top row are from the non- 
hierarchical model, the bottom from the hierarchical eigenmodel. 
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terms of state-specific correlation matrices however, the two models produce quite different results: 
The top and bottom plots of Figure [5] give posterior mean estimates of state-specific correlations 
from the non-hierarchical and hierarchical models, respectively. For each pair of variables, a boxplot 
indicates the across-state heterogeneity among the K = 51 parameter estimates for each correlation 
coefficient. The number of observations per state varies widely, with North Dakota and the District 
of Columbia having 3 respondents each, whereas Texas and California have 225 and 312 respondents 
respectively. Generally speaking, the estimates for states with lower sample sizes appear at the 
extremes of the boxplots, which is not surprising as these estimates have a higher degree of sampling 
variability. Also of note is the fact that there is no coefficient that is estimated as either positive 
across all states or negative across all states. For example, among the highest correlations is that 
between income and education, with an across-state median estimate of 0.28 based on the non- 
hierarchical model. However, there were four states (VT, AK, WY, NE) which were estimated as 
having a negative correlation between these variables. The sample sizes from these states were 5, 
9, 5 and 15 respectively, suggesting that these low correlations may be due to unrepresentative 
samples. In contrast, the hierarchical model recognizes that much of the across-state heterogeneity 
in correlation estimates may be due to sampling variability, and shrinks estimates from low-sample 
size states towards the across-state center. For example, the hierarchical model gives positive point 
estimates for the correlation between income and education for all of the states, including VT, AK, 
WY and NE. As shown in the lower half of Figure |5] across-state heterogeneity among the other 
correlation coefficients is similarly reduced, with nearly two-thirds (23 of 36) of the correlation 
coefficients having sign-consistent estimates across all 51 states. 

The effects of hierarchical estimation are explored further in Figure [5] We have two estimates 
of the eigenvectors for each of the k state-specific correlation matrices: U^ from the hierarchical 
model and U^ from the non-hierarchical model. We can compute a similarity between these two 
matrices as the average of the p values of diag(U;j,Ufc)^. The ffist panel of Figure [sj shows that 
the relationship between the similarity and the within-state sample size is positive as expected: 
Covariance matrices for states with large sample sizes are well-estimated based on within-state 
data alone, and their eigenvector estimates are largely unaffected by hierarchical estimation. In 
contrast, the amount of information from states with low sample sizes is small, and so the estimates 
for the hierarchical model are pulled towards the population mode and away from U^. The effects of 
this shrinkage on the principal axes of the correlation matrices are shown graphically in the second 
and third panels of Figure [5} The second panel plots the projections of the ffist two columns of each 
Ufc onto the ffist two columns of the eigenvector matrix of the pooled correlation matrix. Although 
heterogeneous, the vectors are generally in the same direction, and further inspection shows that 
outliers tend to be states with low sample sizes. The third panel of the figure shows the same plot 
for the projections of the columns of each Ufc from the hierarchical model. The heterogeneity here 
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Figure 5: Effects of shrinkage on the estimated principal axes. The first panel shows the similarity 
between non-hierarchical and hierarchical estimates as a function of sample size. The second and 
third show heterogeneity across the first two principal axes in the non-hierarchical and hierarchical 
models, respectively. 

represents the estimated across-state variability in eigenvectors after accounting for the within- 
state sampling variability. The axis in this plot that is furthest from the center is that representing 
Wisconsin, which has relatively high sample size of 69 but some extreme correlations: For example, 
among states with sample sizes greater than 20, Wisconsin has the lowest non-hierarchical estimate 
of the correlation for (income, education) and (female, bmi), and the highest non-hierarchical 
estimate of the correlation for (income, alcohol). These correlations make Wisconsin somewhat 
of an outlier in terms of the correlations represented by the first two principal components. The 
relatively large sample size for Wisconsin suggests these extreme correlations cannot be solely 
attributed to within-state sampling variability, and this is refiected in the state-specific estimated 
correlation matrix from the hierarchical model. 

6 Discussion 

As an alternative to the Bingham distribution, a simpler model for across-group covariance hetero- 
geneity would be that Si, ... , Yjk are i.i.d. samples from an inverse- Wishart distribution. For many 
applications however, such a model may be too simple: The inverse- Wishart distribution has only 
one parameter to represent heterogeneity around the mean covariance matrix, and cannot represent 
differential amounts of eigenvector heterogeneity as the Bingham distribution can. Additionally, 
the inverse- Wishart model cannot distinguish between across-group eigenvector heterogeneity and 
across-group eigenvalue heterogeneity, as these quantities are modeled simultaneously. 

As we are pooling eigenvector information across groups it is natural to consider pooling eigen- 



19 



values as well. This would entail modeling {Ai, . . . , A^^} as being samples from a common popu- 
lation, and estimating the parameters of this population using the data from all K groups. One 
simple approach to doing this would be to estimate the parameters {i'o,aQ) in the prior distribu- 
tion for the eigenvalues, thus treating the distribution as a sampling model. As with the other 
unknown parameters, this can be done by iteratively updating these parameters based on their 
full conditional distributions. Straightforward calculations show that a gamma prior distribution 
for (Tq results in a gamma full conditional distribution. The full conditional distribution for vq is 
non-standard, but if vq is restricted to the integers then its full conditional distribution can easily 
be sampled from. 

Another possible model extension is to situations where the number of variables is larger than 
any of the within-group sample sizes. In these cases, full-rank covariance estimation can become 
unstable and computationally intractable. A remedy to this problem is to use a factor analysis 
model, in which a covariance matrix S^ is assumed equal to UfcDfcU;. +'^fc-'-) where D^ is a positive 
diagonal matrix and U^ is a p x r orthonormal matrix with r < p, an element of the Stiefel manifold 
Sr,p- As before, heterogeneity across covariance matrices can be expressed by heterogeneity in these 
matrix components, and a Bingham model on Sr,p, similar to the one used in this paper, can be 
used to express heterogeneity among Ui, . . . , Ui<-. 

Computer code and data for the examples in this article are available at 



http : //www . Stat . Washington . edu/hof f / 
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